Updating the JIT to handle the FMA hardware intrinsics #18105

tannergooding · 2018-05-23T22:18:23Z

This resolves https://github.com/dotnet/coreclr/issues/18081

tannergooding · 2018-05-23T22:19:36Z

src/jit/lowerxarch.cpp

+        }
+        else
+        {
+            // TODO-XArch-CQ: Technically any one of the three operands can


@CarolEidt, will there be any problems if we set op1, op2, and op3 as regOptional? Will the JIT pick just one, or will it actually try to spill all three?

You have to pick just one. #6361 is the issue that tracks allowing multiple operands to be specified as regOptional.

Thanks! I'm going to update the comment with a link to the bug.

tannergooding · 2018-05-23T22:24:33Z

tests/src/JIT/HardwareIntrinsics/X86/Fma_Vector128/MultiplyAdd.Double.cs

+            {
+                for (var i = 1; i < RetElementCount; i++)
+                {
+                    if (BitConverter.DoubleToInt64Bits(Math.Round((firstOp[i] * secondOp[i]) + thirdOp[i], 13)) != BitConverter.DoubleToInt64Bits(Math.Round(result[i], 13)))


All of the FMA tests do the same thing, but I am going to explicitly call it out here.

We don't have a good way to test FMA today due to the way the operation is executed. Technically, the operation does:

temp = firstOp * secondOp temp = temp + thirdOp result = round(temp)

This is in contrast to the two individual operations which do:

temp = firstOp * secondOp result = round(temp) temp = result + thirdOp result = round(temp)

This is probably the "easiest" solution right now.

tannergooding · 2018-05-23T22:48:25Z

Also FYI for any reviewers. The tests are auto-generated from a template.

The first commit (~650 Lines) contains the product changes and needs actual review.

The second commit (the remaining 12k lines) contains only test changes and probably does not need in depth review (outside of checking the template input data and looking at one or two tests as an example).

dangi12012 · 2018-05-23T22:58:59Z

I still dont understand why the rounding should be a problem. GCC does use FMA for a*b+c by default.
FMA would make Vector Dot Product much faster and increases every JIT compiled language by a lot!

tannergooding · 2018-05-23T23:25:31Z

I still dont understand why the rounding should be a problem. GCC does use FMA for a*b+c by default.
FMA would make Vector Dot Product much faster and increases every JIT compiled language by a lot!

@dangi12012, Only if you:

Explicitly enable the FMA instruction set (via -mfma)
and
Use -O2 or higher

Additionally, as per the IEEE 754:2008, automatically transforming (a * b) + c into a fusedMultiplyAdd operation is considered a value-changing optimization. Automatically doing these types of optimizations can change the output of a given program and introduce subtle and hard to diagnose bugs.

If such automatic optimizations were provided, they would need to be behind a switch that users can enable or disable as required.

Edit: It may also be worth mentioning that GCC is the only major compiler doing this (you have to opt-in using -ffast-math or /fp:fast with other compilers): https://godbolt.org/g/6yWntG

tannergooding · 2018-05-24T00:02:12Z

Rebased onto dotnet/master to pickup #18078

Jorenkv · 2018-05-24T09:00:50Z

I think you could test the FMA computation by applying the TwoProduct and GrowExpansion routines from this paper to compute an exact result, then round it to a single/double value: https://people.eecs.berkeley.edu/~jrs/papers/robust-predicates.pdf

4creators · 2018-05-24T11:21:28Z

Also FYI for any reviewers. The tests are auto-generated from a template.

The first commit (~650 Lines) contains the product changes and needs actual review.

@tannergooding Why don't you split PR into 2 commits: (i) JIT changes, (ii) Tests - to be consistent with our previous work?

tannergooding · 2018-05-24T13:00:06Z

Why don't you split PR into 2 commits: (i) JIT changes, (ii) Tests - to be consistent with our previous work?

@4creators, that isn't consistent with my previous HWIntrinsic work and it isn't consistent with standard practice.

Generally speaking, you shouldn't add new product code without corresponding tests exercising said code. As, otherwise, you don't have any validation that your code is correct and functioning properly.

It is completely fine, however, to split the product code and tests into two separate commits. For discoverability, ease of review, etc.

tannergooding · 2018-05-24T13:09:40Z

I think you could test the FMA computation by applying the TwoProduct and GrowExpansion routines from this paper to compute an exact result, then round it to a single/double value: https://people.eecs.berkeley.edu/~jrs/papers/robust-predicates.pdf

@Jorenkv, thanks for the reference.

There are definitely many ways in which we can "exactly" validate the result. However, this may be more effort than it is worth and I would like to hear from @CarolEidt/@eerhardt before investing significant time in doing that.

The general rule, so far, has been that the HWIntrinsics are contracted to:

emit a particular hardware instruction
not modify the inputs
not modify the outputs

There are of course certain exceptions, such as the "helper" intrinsics (which don't map to any particular instruction) or certain intrinsics (like CompareEqualOrderedScalar) whose instructions return one or more flags, which have to be converted to some form of boolean expression instead.

The current validation logic ensures that we are emitting some form of MultiplyAdd instruction without worrying (too much) about the underlying implementation/behavior of said instruction (we are basically just validating we didn't encode the instruction wrong).

4creators · 2018-05-24T13:57:09Z

that isn't consistent with my previous HWIntrinsic work and it isn't consistent with standard practice

OK So as usual you are contradicting yourself just to have different opinion 😄, AFAIR you explicitly asked me to submit PRs as two separate commits: (i) commit changing product code -> JIT, (ii) commit adding tests. Should I dig up your specific request?

tannergooding · 2018-05-24T15:00:49Z

OK So as usual you are contradicting yourself just to have different opinion 😄, AFAIR you explicitly asked me to submit PRs as two separate commits: (i) commit changing product code -> JIT, (ii) commit adding tests. Should I dig up your specific request?

Yes, two commits (which is exactly the shape of the current PR). I may have misread your original comment as I thought you were asking why this isn't split into two PRs.

4creators · 2018-05-24T15:10:00Z

@tannergooding Ahh, OK I see how this issue arised.

tannergooding · 2018-05-24T16:50:07Z

Ok, test failures are all in the no AVX runs.

FMA failures are because the ISA isn't filtered out when AVX is disabled (which, today, basically controls the VEX support that FMA requires).

The remaining failures look to have been introduced by #16517 (the last non-PR run doesn't yet include my most recent changes to BuildHWIntrinsic: https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x86_checked_windows_nt_jitx86hwintrinsicnoavx/130/).

They all look to be similar in error: Assertion failed '(consume == 0) || (ComputeAvailableSrcCount(tree) == consume)' -- CC. @CarolEidt

tannergooding · 2018-05-24T17:11:28Z

Rebased onto master to pick up #18116
Fixed up the compiler to no longer enable FMA if the global AVX flag is also disabled.

tannergooding · 2018-05-24T20:13:40Z

Logged https://github.com/dotnet/coreclr/issues/18119 for the remaining, unrelated, failures.

tannergooding · 2018-05-24T21:22:46Z

PR resolving the other test failures is here: #18120

tannergooding · 2018-05-24T23:28:13Z

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

tannergooding · 2018-05-24T23:28:38Z

@CarolEidt, all issues should now be resolved.

Jorenkv · 2018-05-25T06:40:32Z

@tannergooding Ah of course. You're not doing software fallback so then there's no point in checking if the computation is precisely correct.

CarolEidt

I've looked through all the code, and have just a couple of comment/naming suggestions - as well as a couple of things to consider for future.
I plan to look at the encodings (instrxarch.h) just to double-check against the arch manual.

CarolEidt · 2018-05-25T15:08:32Z

src/jit/compiler.cpp

@@ -2606,6 +2582,34 @@ void Compiler::compSetProcessor()
                }
            }
        }
+
+        // There are currently two sets of flags that control AVX, FMA, and AVX2 support
+        // This is the general EnableAVX flag and the individual ISA flags. We need to


Nit: I would end the first line with ':' and start the second line with "These are ..."

CarolEidt · 2018-05-25T16:15:08Z

src/jit/emitxarch.cpp

@@ -5360,12 +5455,85 @@ void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber reg,
    }
 }

+void emitter::emitIns_SIMD_R_R_R_A(
+    instruction ins, emitAttr attr, regNumber reg, regNumber reg1, regNumber reg2, GenTreeIndir* indir)


In this (and subsequent), I would name the first reg argument targetReg or dstReg or something. Unlike some of the other methods, this one is designed only to support the case where the first argument is the dest, so it would be good to be descrptive.

Can I submit a separate PR fixing up all the register names here? Currently the majority are regNumber reg where it should be regNumber targetReg and it would be nice to fix them all up (I could also do it in this PR, if you think that is fine).

CarolEidt · 2018-05-25T16:19:01Z

src/jit/gentree.cpp

@@ -17422,6 +17422,16 @@ bool GenTree::isRMWHWIntrinsic(Compiler* comp)
    switch (AsHWIntrinsic()->gtHWIntrinsicId)
    {
        case NI_SSE42_Crc32:
+        case NI_FMA_MultiplyAdd:


It seems to me that perhaps we should just use a flag for this - especially if there are more intrinsics that will have this behavior. As I think I've mentioned before I don't like negative booleans - so I would prefer changing the existing flag to something like HW_Flag_SSE_RMWSemantics and then adding HW_Flag_RMWSemantics. But that is a topic for another discussion.

It seems to me that perhaps we should just use a flag for this - especially if there are more intrinsics that will have this behavior.

I don't think we have that many of them (there are very few RMW intrinsics for the VEX encoding). I'll add a comment indicating we should revisit if this grows any more.

As I think I've mentioned before I don't like negative booleans - so I would prefer changing the existing flag to something like HW_Flag_SSE_RMWSemantics and then adding HW_Flag_RMWSemantics. But that is a topic for another discussion.

Right, I think we have an issue tracking this. -- I think the primary problem right now is that the majority case is positive, and it is easier to track/add the minority case (whether positive or negative).

Added a TODO-XArch-Cleanup comment here.

CarolEidt · 2018-05-25T16:21:53Z

src/jit/hwintrinsiccodegenxarch.cpp

+            varNum = tmpDsc->tdTempNum();
+            offset = 0;
+
+            compiler->tmpRlsTemp(tmpDsc);


Another possible comment for future: if there isn't a method to do this (I couldn't find it off-hand), perhaps there should be (returning a varNum):

tmpDsc = getSpillTempDsc(op3); varNum = tmpDsc->tdTempNum(); compiler->tmpRlsTemp(tmpDsc);

Added a TODO-XArch-Cleanup comment

CarolEidt · 2018-05-25T16:36:01Z

@fiigii - would you have time to also have a look at this?

CarolEidt

LGTM
Thanks! And sorry it took me a while to get to this.

…abled.

fiigii · 2018-05-25T22:42:54Z

@CarolEidt Sorry, I have no time to look at the PR in detail until 6/10. Please feel free to move forward if it looks good to you.

CarolEidt · 2018-05-25T23:14:52Z

@fiigii - no problem, but thanks for the quick response.

tannergooding commented May 23, 2018

View reviewed changes

CarolEidt reviewed May 25, 2018

View reviewed changes

CarolEidt approved these changes May 25, 2018

View reviewed changes

tannergooding added 3 commits May 25, 2018 10:08

Updating the JIT to handle the FMA hardware intrinsics.

7f0b650

Adding tests for the FMA hardware intrinsics.

0eeade9

Updating the compiler to not enable FMA if the global AVX flag is dis…

850716a

…abled.

tannergooding merged commit 8db778b into dotnet:master May 25, 2018

tannergooding deleted the hwintrin-fma branch May 30, 2018 04:15

tannergooding mentioned this pull request Jan 31, 2020

HWIntrinsic failures for the RO tests when AVX disabled dotnet/runtime#10377

Closed

Updating the JIT to handle the FMA hardware intrinsics #18105

Updating the JIT to handle the FMA hardware intrinsics #18105

Conversation

tannergooding commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented May 23, 2018

dangi12012 commented May 23, 2018

tannergooding commented May 23, 2018 • edited Loading

tannergooding commented May 24, 2018

Jorenkv commented May 24, 2018 • edited Loading

4creators commented May 24, 2018

tannergooding commented May 24, 2018

tannergooding commented May 24, 2018

4creators commented May 24, 2018

tannergooding commented May 24, 2018

4creators commented May 24, 2018

tannergooding commented May 24, 2018

tannergooding commented May 24, 2018

tannergooding commented May 24, 2018

tannergooding commented May 24, 2018

tannergooding commented May 24, 2018

tannergooding commented May 24, 2018

Jorenkv commented May 25, 2018

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented May 25, 2018

CarolEidt left a comment

Choose a reason for hiding this comment

fiigii commented May 25, 2018

CarolEidt commented May 25, 2018

tannergooding commented May 23, 2018 •

edited

Loading

Jorenkv commented May 24, 2018 •

edited

Loading